Annotating foreign learners’ Czech

نویسنده

  • Barbora Štindlová
چکیده

One of the challenges of contemporary corpus linguistics is the compilation and annotation of corpora consisting of texts produced by non-native speakers. In addition to morphosyntactic tagging and lemmatisation, such texts can be annotated by information relevant to the specific nonstandard use. Cases of deviant language use can be corrected and identified by a tag specifying the type of the error. Because of the properties of Czech, namely rich inflection, derivation, agreement, and a largely information-structure-driven constituent order, it is not straightforward to design an annotation scheme satisfying all requirements on the description of errors produced by non-native learners. Our proposal aims at an optimal solution that is still realistic given the annotation costs and the demands of the corpus users. After an overview of issues related to learner corpora in §2 and a brief introduction to the project of a learner corpus of Czech in §3 we present the issues of annotation in §4 and the concept of our annotation scheme in §5, followed by a description of the annotation process in §6.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building and Using Corpora of Non-Native Czech

Investigating language acquisition by non-native learners helps to understand important linguistic issues and develop teaching methods, better suited both to the specific target language and to the learner. These tasks can now be based on empirical evidence from learner corpora. A learner corpus consists of language produced by language learners, typically learners of a second or foreign langua...

متن کامل

Automatic Identification of Learners' Language Background Based on Their Writing in Czech

The goal of this study is to investigate whether learners’ written data in highly inflectional Czech can suggest a consistent set of clues for automatic identification of the learners’ L1 background. For our experiments, we use texts written by learners of Czech, which have been automatically and manually annotated for errors. We define two classes of learners: speakers of Indo-European languag...

متن کامل

Semantic Annotating of Czech Corpus via WSD

We would like to describe the relationship between word sense disambiguation (WSD) and language resources (LR) working with word senses. We discuss the problem of sense division and tagging. Exploiting specific features of the inflectional languages for WSD is encouraged. We present WSD methods for Czech ambiguous nouns. The advantage of these methods consists in reducing the manual work by usi...

متن کامل

Creating annotated resources for polarity classification in Czech

This paper presents the first steps towards reliable polarity classification based on Czech data. We describe a method for annotating Czech evaluative structures and build a standard unigram-based Naive Bayes classifier on three different types of annotated texts. Furthermore, we analyze existing results for both manual and automatic annotation, some of which are promising and close to the stat...

متن کامل

Perception of Czech Vowel Quantity by English Learners of Czech

Acquiring L2 vowel quantity can be difficult for native speakers of languages like English where vowel duration cues stress. This study tested whether English learners of Czech would categorize short and long vowels in a stressed or in an unstressed syllable differently than native listeners. The role of L2 experience was also explored. Results showed that the native and nonnative listeners did...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010